{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  Runtimes: What are my options? How do I choose?\n",
    "\n",
    "Remember that TensorRT consists of two main components - __1. A series of parsers and integrations__ to convert your model to an optimized engine and __2. An series of TensorRT runtime APIs__ with several associated tools for deployment.\n",
    "\n",
    "In this notebook, we will focus on the latter - various runtime options for TensorRT engines.\n",
    "\n",
    "The runtimes have different use cases for running TRT engines. \n",
    "\n",
    "### Considerations when picking a runtime:\n",
    "\n",
    "Generally speaking, there are a few major considerations when picking a runtime:\n",
    "- __Framework__ - Some options, like TF-TRT, are only relevant to Tensorflow\n",
    "- __Time-to-solution__ - TF-TRT is much more likely to work 'out-of-the-box' if a quick solution is required and ONNX fails\n",
    "- __Serving needs__ - TF-TRT can use TF Serving to serve models over HTTP as a simple solution. For other frameworks (or for more advanced features) TRITON is framework agnostic, allows for concurrent model execution or multiple copies within a GPU to reduce latency, and can accept engines created through both the ONNX and TF-TRT paths\n",
    "- __Performance__ - Different TensorRT runtimes offer varying levels of performance. For example, TF-TRT is generally going to be slower than using ONNX or the C++ API directly.\n",
    "\n",
    "### Python API:\n",
    "\n",
    "__Use this when:__\n",
    "- You can accept some performance overhead, and\n",
    "- You are most familiar with Python, or\n",
    "- You are performing initial debugging and testing with TRT\n",
    "\n",
    "__More info:__\n",
    "\n",
    " \n",
    "The [TensorRT Python API](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#perform_inference_python) gives you fine-grained control over the execution of your engine using a Python interface. It makes memory allocation, kernel execution, and copies to and from the GPU explicit - which can make integration into high performance applications easier. It is also great for testing models in a Python environment - such as in a Jupyter notebook.\n",
    " \n",
    "The [ONNX notebook for Tensorflow](./3.%20Using%20Tensorflow%202%20through%20ONNX.ipynb) and [for PyTorch]() are good examples of using TensorRT to get great performance while staying in Python\n",
    "\n",
    "### C++ API: \n",
    "\n",
    "__Use this when:__\n",
    "- You want the least amount of overhead possible to maximize the performance of your models and achieve better latency\n",
    "- You are not using TF-TRT (though TF-TRT graph conversions that only generate a single engine can still be exported to C++)\n",
    "- You are most familiar with C++\n",
    "- You want to optimize your inference pipeline as much as possible\n",
    "\n",
    "__More info:__\n",
    "\n",
    "The [TensorRT C++ API](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#perform_inference_c) gives you fine-grained control over the execution of your engine using a C++ interface. It makes memory allocation, kernel execution, and copies to and from the GPU explicit - which can make integration into high performance C++ applications easier. The C++ API is generally the most performant option for running TensorRT engines, with the least overhead.\n",
    "\n",
    "[This NVIDIA Developer blog](https://developer.nvidia.com/blog/speed-up-inference-tensorrt/) is a good example of taking an ONNX model and running it with dynamic batch size support using the C++ API.\n",
    "\n",
    "\n",
    "### Tensorflow/TF-TRT Runtime: (Tensorflow Only) \n",
    "    \n",
    "__Use this when:__\n",
    "    \n",
    "- You are using TF-TRT, and\n",
    "- Your model converts to more than one TensorRT engine\n",
    "\n",
    "__More info:__\n",
    "\n",
    "\n",
    "TF-TRT is the standard runtime used with models that were converted in TF-TRT. It works by taking groups of nodes at once in the Tensorflow graph, and replacing them with a singular optimized engine that calls the TensorRT Python API behind the scenes. This optimized engine is in the form of a Tensorflow operation - which means that your graph is still in Tensorflow and will essentially function like any other Tensorflow model. For example, it can be a useful exercise to take a look at your model in Tensorboard to validate which nodes TensorRT was able to optimize.\n",
    "\n",
    "If your graph entirely converts to a single TF-TRT engine, it can be more efficient to export the engine node and run it using one of the other APIs. You can find instructions to do this in the [TF-TRT documentation](https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html#tensorrt-plan).\n",
    "\n",
    "As an example, the TF-TRT notebooks included with this guide use the TF-TRT runtime.\n",
    "\n",
    "###  TRITON Inference Server\n",
    "\n",
    "__Use this when:__\n",
    "- You want to serve your models over HTTP or gRPC\n",
    "- You want to load balance across multiple models or copies of models across GPUs to minimze latency and make better use of the GPU\n",
    "- You want to have multiple models running efficiently on a single GPU at the same time\n",
    "- You want to serve a variety of models converted using a variety of converters and frameworks (including TF-TRT and ONNX) through a uniform interface\n",
    "- You need serving support but are using PyTorch, another framework, or the ONNX path in general\n",
    "\n",
    "__More info:__\n",
    "\n",
    "\n",
    "TRITON is an open source inference serving software that lets teams deploy trained AI models from any framework (TensorFlow, TensorRT, PyTorch, ONNX Runtime, or a custom framework), from local storage or Google Cloud Platform or AWS S3 on any GPU- or CPU-based infrastructure (cloud, data center, or edge). It is a flexible project with several unique features - such as concurrent model execution of both heterogeneous models and multiple copies of the same model (multiple model copies can reduce latency further) as well as load balancing and model analysis. It is a good option if you need to serve your models over HTTP - such as in a cloud inferencing solution.\n",
    "    \n",
    "You can find the TRITON home page [here](https://developer.nvidia.com/nvidia-triton-inference-server), and the documentation [here](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
